This is a modeling exercise project similar to a kaggle competition. The goal is to build a predictive model.
Amount variable. The model will be evaluated on the mean absolute error (MAE) between the predicted target and the actual target for the holdout dataset.ID in the same format as the sample provided in sample.txt.I asked myself, what is the significance of using MAE instead of RMSE? And then I looked on google. When I do use a google source I try to keep track of the link.
MAE the mean absolute value difference between the predictions, \(\hat{y_i}\), and the target, \(y_i\). This means that positive errors are penalized in the same way as negative errors.
\[\text{MAE} = \frac{1}{n}\sum{|y_i - \hat{y_i}|}\]
Root mean squared error treats positive and negative errors equally, just like MAE, but imposes a harsher penalty outliers, observations where there is a large error. This is because it takes the square of the error.
\[\text{RMSE} = \sqrt{\frac{1}{n}\sum{(y_i - \hat{y_i})^2}}\] The takeaway is that outliers will be less of an issue in this analysis than in RMSE were used.
The data is very ordinary. There are two data files. The train file has a target value, amount, which I am trying to predict. The test file does not have this label and will be used for final evaluation.
train_raw <- read_csv("train.txt")
test_raw<- read_csv("test.txt")
sample_submission <- read_csv("sample.txt")
There is a reasonably large sample size, with 205062 in 51264, which is great because all models perform better with a larger sample size. A larger n generally reduces both the bias and the variance.
There are 34 features, which are all numeric and unnamed. We also see that there are no missing values, which makes life much easier!
head(train_raw , 10)
## # A tibble: 10 x 35
## ID V1 V2 V3 V4 V5 V6 V7 V8 V9
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22144 -2.08 -1.63 1.25 -2.17 0.292 -0.956 0.811 0.167 0.756
## 2 251626 2.07 0.0784 -2.13 0.192 0.668 -0.622 0.140 -0.167 0.453
## 3 156283 1.96 -0.185 -0.286 0.330 -0.302 -0.139 -0.601 -0.0294 2.27
## 4 56925 -2.04 -6.35 -0.646 0.920 -3.18 0.946 0.928 -0.0913 -0.277
## 5 271660 -0.826 0.729 1.18 -0.292 -0.563 -0.228 0.342 0.158 0.349
## 6 270117 -0.598 0.993 2.06 0.470 0.307 0.533 0.351 0.216 -0.683
## 7 4429 1.41 -0.668 -0.910 -1.62 1.46 3.27 -1.18 0.712 0.361
## 8 43651 -2.12 0.635 1.56 1.43 -1.39 -0.205 -0.403 0.931 -0.471
## 9 75283 0.885 -1.11 0.602 0.502 -1.03 0.633 -0.691 0.226 -0.691
## 10 45786 0.999 -0.180 1.18 1.07 -0.530 0.953 -0.748 0.477 0.467
## # ... with 25 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## # V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
## # V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
## # V32 <dbl>, V33 <dbl>, Amount <dbl>
For all data manipulations, I will combine the two data sets together in order to insure equal treatment.
combined <- train_raw %>%
mutate(source = "train") %>%
rbind(test_raw %>% mutate(Amount = "None", source = "test")) %>%
mutate(Amount = as.numeric(Amount))
The first item I look at is the distribution of the target, Amount. This is highly right skewed, and so I take a log transform. This helps to normalize the shape, which is useful for many modeling applications. We also see that there are significant spikes. I also looked at a box cox transform, which is a more general case of the log transform. In fact, the box cox is the same as the log when lambda is 0. When I tested his, lambda was fairly close to 0, so I just used the log. This is for simplicity and consistency between the different data sets.
This distribution has a very long tail.
train_raw %>%
sample_frac(0.2) %>%
ggplot(aes(Amount)) +
geom_histogram() +
ggtitle("Distribution of Amount") +
xlim(0, 100000)
A key assumption of many linear models is that the response be normally distributed. Taking the log transform helps to make this closer to a normal distribution.
combined <- combined %>% mutate(target = log(Amount + 1))
combined %>%
sample_frac(0.6) %>%
ggplot(aes(sample = target)) +
stat_qq() +
stat_qq_line() +
ggtitle("Empirical Normal Quantiles vs Theoretical Quantiles after Power Transform")
This is better but still not perfect. I notice there is a spike at Amount = 100.100. There are a few other point masses as well. This is very common in insurance data especially due to deductibles.
combined %>%
group_by(Amount) %>%
summarise(n = n()) %>%
arrange(desc(n))
## # A tibble: 31,117 x 2
## Amount n
## <dbl> <int>
## 1 NA 28480
## 2 100. 12302
## 3 198. 5407
## 4 89.1 4364
## 5 1000. 4272
## 6 1502. 2912
## 7 76.1 2689
## 8 1001 2655
## 9 129. 2608
## 10 179. 2335
## # ... with 31,107 more rows
I create new categorical variables which capture these point masses.
combined <- combined %>%
mutate(spike = as.factor(coalesce(case_when(
Amount %in% c(100.100, 198.198, 89.089, 999.999) ~ as.character(Amount),
Amount < 100.100 ~ "zero"
), "none")))
After removing these, the distribution looks perfectly normal
combined %>%
sample_frac(0.8) %>%
filter(spike == "none") %>%
ggplot(aes(target)) +
geom_histogram() +
ggtitle("Distribution of Target After Removing 'Deductibles'")
As an extra precaution, I compared the two training and test_rawset distributions to see if they are taken from the same distributions. My strategy was to compare the medians, 1st quantiles, and 3rd quantiles.
They appear to be exactly the same.
first_quantile <- function(x){quantile(x, 0.25)}
third_quantile <- function(x){quantile(x, 0.25)}
combined %>%
group_by(source) %>%
select(-Amount, -ID, -target, -spike) %>%
summarise_all(funs(first_quantile, median, third_quantile
)) %>%
gather(feature, stat, -source) %>%
spread(source, stat) %>%
mutate(percent_difference = abs((test - train)/train)) %>%
arrange(desc(percent_difference)) %>%
datatable()
As seen in these graphs, many of these features have long positive and negative tails. They are all centred at zero, and most are even symmetric. This saves a lot of time because otherwise there would need to be transformations so that they look this way in order to use linear models.
combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
sample_frac(0.2) %>%
gather(column, value, 1:10) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(vars(column), scales = "free")
combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
sample_frac(0.2) %>%
gather(column, value, 11:20) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(vars(column), scales = "free")
V30 through V33 are uniform.
combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
sample_frac(0.2) %>%
gather(column, value, 21:33) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(vars(column), scales = "free")
Coincidently, we see that the features are already ordered by their standard deviations!
combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
summarise_all(sd) %>%
gather(feature, value) %>%
arrange(desc(value))
## # A tibble: 35 x 2
## feature value
## <chr> <dbl>
## 1 V1 1.96
## 2 V2 1.65
## 3 V3 1.52
## 4 V4 1.42
## 5 V5 1.38
## 6 V6 1.33
## 7 V7 1.24
## 8 V8 1.19
## 9 V9 1.10
## 10 V10 1.09
## # ... with 25 more rows
The main difficulty with this data is the length of some of the tails. I looked at univariate stats to see if there was a pattern. Here I define an outlier as being above the 99.99th quantile or below the 0.0001st quantile.
IQR_outlier <- function(input_column, alpha = 0.001){
x <- unlist(input_column)
#return 0 if within 3*IQR and 1 if outside
upper_bound <- quantile(x, 1 - alpha)
lower_bound <- quantile(x, alpha)
is_outlier <- (x < lower_bound | x > upper_bound)
percent_outlier = mean(is_outlier, na.rm = TRUE)
return(percent_outlier)
}
When running this over all observations, we again coincidently see that all columns have the same number of outliers.
train_raw %>%
select(-ID) %>%
map_df(IQR_outlier, alpha = 0.001) %>%
gather(column, percent_univariate_outlier) %>%
arrange(desc(percent_univariate_outlier)) %>%
datatable()
At this point I was beyond suspecting that the data was simulated. If this was the case, maybe there was a pattern to the outliers. Because these cause difficulty with linear models later on, I spent some time looking at them.
As seen below, some of the points are outside the IQR range for many different dimensions. Observation number 118164 for example is either above or below the 99.99st/0.001st quantile in 21 out of 33 dimensions.
outlier_summary <- train_raw %>%
select(-ID) %>%
map_df(function(input_column, alpha = 0.0001){
x <- unlist(input_column)
#return 0 if within 3*IQR and 1 if outside
upper_bound <- quantile(x, 1 - alpha)
lower_bound <- quantile(x, alpha)
is_outlier <- (x < lower_bound | x > upper_bound)
return(is_outlier)
}) %>%
mutate(obs_number = row_number()) %>%
gather(feature, outlier, -obs_number) %>%
arrange(desc(obs_number)) %>%
group_by(obs_number) %>%
summarise(total_outliers = sum(outlier)) %>%
arrange(desc(total_outliers)) %>%
filter(total_outliers>0)
head(outlier_summary)
## # A tibble: 6 x 2
## obs_number total_outliers
## <int> <int>
## 1 118164 21
## 2 117427 15
## 3 18186 12
## 4 185442 11
## 5 215346 11
## 6 247628 11
I try looking at these outliers within the data to see if there is a visible pattern. The sizes of the points represents the number of dimensions where it is outside the IQR. Observation number 274772 is guilty in more than 20 dimensions.
train_raw %>%
mutate(obs_number = row_number()) %>%
dplyr::slice(outlier_summary$obs_number) %>%
left_join(outlier_summary, by = c("obs_number")) %>%
mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>%
ggplot(aes(V2, V6, size = total_outliers)) +
geom_point(color='dodgerblue') +
geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) +
ggtitle("High Outlying Points Across Multiple Dimensions")
Again, 274772 shows up, only this time in V6 and V7.
train_raw %>%
mutate(obs_number = row_number()) %>%
dplyr::slice(outlier_summary$obs_number) %>%
left_join(outlier_summary, by = c("obs_number")) %>%
mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>%
ggplot(aes(V7, V21, size = total_outliers)) +
geom_point(color='dodgerblue') +
geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) +
ggtitle("High Outlying Points Across Multiple Dimensions")
train_raw %>%
mutate(obs_number = row_number()) %>%
dplyr::slice(outlier_summary$obs_number) %>%
left_join(outlier_summary, by = c("obs_number")) %>%
mutate(total_outliers_label = ifelse(total_outliers > 10, ID, NA)) %>%
ggplot(aes(V11, V28, size = total_outliers)) +
geom_point(color='dodgerblue') +
geom_text(aes(label = total_outliers_label, hjust=1, vjust=1)) +
ggtitle("Do you really expect to hide, 274772?")
We can look at the assymetry of each feature from the skewness. Let’s look at the top 10 most skewed features in more detail.
skewed_features <- combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
summarise_all(skewness) %>%
gather(feature, skew) %>%
mutate(abs_skew = abs(skew)) %>%
arrange(desc(abs_skew)) %>%
select(feature) %>%
unlist() %>%
as.character()
combined %>%
select(skewed_features[1:10]) %>%
gather(column, value) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(vars(column), scales = "free")
train_raw %>%
select(skewed_features[1:10]) %>%
dplyr::slice(outlier_summary$obs_number) %>%
gather(column, value) %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(vars(column), scales = "free")
These features appear to be independent with the exception of V15 and V29.
correlation <- combined %>%
select_if(is.numeric) %>%
select(-ID) %>%
cor()
corrplot(correlation,
type = "upper")
V15 is a duplicate of v29. After dropping v29, let’s look again at the correlations.
combined %>%
sample_frac(0.2) %>%
ggplot(aes(V15, V29)) +
geom_point() +
ggtitle("Column 15 is a duplicate of v29")
correlation <- train_raw %>%
select_if(is.numeric) %>%
mutate(target = log(Amount + 1)) %>%
select(-ID, -V29, -Amount) %>%
cor()
corrplot(correlation,
type = "upper")
Which features have the highest correlation with the target Amount?
top_5_corr_with_amount <- data_frame(feature = paste0("V", 1:33), corr_with_target = as.numeric(correlation[, 'target'])) %>%
filter(corr_with_target != 1) %>%
mutate(abs_corr_with_amount = abs(corr_with_target)) %>%
arrange(desc(abs_corr_with_amount)) %>%
top_n(5) %>%
select(feature) %>%
unlist() %>%
as.character()
train_raw %>%
sample_frac(0.1) %>%
mutate(target = log(Amount + 1)) %>%
select(top_5_corr_with_amount, target) %>%
ggpairs()
PCA is a way of reducing the dimensions of a matrix by finding a linear subset of the features which explains most of the variance.
As seen below, most of the variation can be explained by the first principal component.
model_matrix <- train_raw %>% dplyr::select(-Amount)
pca <- prcomp(model_matrix, scale = T, center = T)
plot(pca, type = "l")
The first PC explains 6% of the total variance. THe other features each explain about exactly 3%. This suggests that the data was simulated. These features are almost perfectly independent. If they were perfectly independent, each variable would explain 1/33rd percent of the variance, which comes out to 0.0303….
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.44501 1.25170 1.02203 1.00889 1.00715 1.00669
## Proportion of Variance 0.06141 0.04608 0.03072 0.02994 0.02983 0.02981
## Cumulative Proportion 0.06141 0.10749 0.13822 0.16815 0.19799 0.22779
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.00541 1.00480 1.00350 1.00295 1.00290 1.00235
## Proportion of Variance 0.02973 0.02969 0.02962 0.02959 0.02958 0.02955
## Cumulative Proportion 0.25752 0.28722 0.31684 0.34642 0.37601 0.40556
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 1.0015 1.00124 1.00111 1.00076 1.00042 1.00009
## Proportion of Variance 0.0295 0.02948 0.02948 0.02946 0.02944 0.02942
## Cumulative Proportion 0.4351 0.46454 0.49402 0.52347 0.55291 0.58233
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.9998 0.99933 0.99921 0.99888 0.9981 0.99782
## Proportion of Variance 0.0294 0.02937 0.02937 0.02935 0.0293 0.02928
## Cumulative Proportion 0.6117 0.64110 0.67046 0.69981 0.7291 0.75839
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.99669 0.99613 0.99493 0.99370 0.9912 0.98988
## Proportion of Variance 0.02922 0.02918 0.02911 0.02904 0.0289 0.02882
## Cumulative Proportion 0.78761 0.81679 0.84591 0.87495 0.9038 0.93266
## PC31 PC32 PC33 PC34
## Standard deviation 0.98746 0.98415 0.58808 8.85e-15
## Proportion of Variance 0.02868 0.02849 0.01017 0.00e+00
## Cumulative Proportion 0.96134 0.98983 1.00000 1.00e+00
train_raw %>%
sample_frac(0.2) %>%
mutate(target = log(Amount + 1)) %>%
select(2:6, target) %>%
ggpairs(mapping = aes(fill = target), progress = F)
train_raw %>%
sample_frac(0.1) %>%
mutate(target = log(Amount + 1)) %>%
select(7:12, target) %>%
ggpairs(mapping = aes(fill = target), progress = F)
These variables are useless. Don’t use these.
train_raw %>%
sample_frac(0.1) %>%
mutate(target = log(Amount + 1)) %>%
select(V13, V28, V29, V30, V31, target) %>%
ggpairs(mapping = aes(fill = target), progress = F)
This file will be where I do the model building.
We will split the train data into a smaller training set and a validation sets. What is a good size to split at? With less training data, there is more variance in the parameter estimates. If the test set is too small, there will be high variance in the predictions. If the training set is too small, there will be a lot of variance in the training parameters. The goal is to divide the data such that there is a good balance the two.
The process will be
train into 80% training and 20% validationtest.txt.Prepare data for training and testing.
#Remember to subtract 1 from predictions before submitting!!
df <- combined %>% select( -V29, -ID, -spike)
data_for_training <- df %>% filter(source == "train") %>% select(-source, -Amount)
data_for_predictions <- df %>% filter(source == "test") %>% select(-source, -Amount)
train_index <- createDataPartition(data_for_training$target, p = 0.8, list = FALSE) %>% as.numeric()
train <- data_for_training %>% dplyr::slice(train_index) %>% select(target, everything())
test <- data_for_training %>% dplyr::slice(-train_index) %>% select(target,everything())
#some models need a matrix instead of a data frame
train_x <- train %>% select(-target) %>% as.matrix()
train_y <- train$target
holdout_x <- data_for_predictions %>% dplyr::select(-target) %>% as.matrix()
#some models need a matrix instead of a data frame
test_x <- test %>% select(-target) %>% as.matrix()
test_y <- test$target
saveRDS(train_x, "train_x.RDS")
saveRDS(train_y, "train_y.RDS")
This is the data excluding the spikes. I thought about fitting a model for the “deductibles” seperatley. Using “Is_Spike” as a target, I built a model with 88% accurracy but never used it. As you can imagein, this could have been stacked on top of a regression model.
df <- combined %>% filter(spike == "none") %>% select( -Amount, -V29, -ID)
data_for_training <- df %>% filter(source == "train") %>% select(-source, -spike)
data_for_predictions <- df %>% filter(source == "test") %>% select(-source, -spike)
train_index <- createDataPartition(data_for_training$target, p = 0.8, list = FALSE) %>% as.numeric()
train_no_spikes <- data_for_training %>% dplyr::slice(train_index) %>% select(target, everything())
test_no_spikes <- data_for_training %>% dplyr::slice(-train_index) %>% select(target, everything())
I fit a baseline model before doing any transformations. This allows me to assess if I’m making improvements reducing the “irreducible” error. I tried not to look at the test error when tuning parameters to avoid human-based overfitting.
eval_model <- function(input_model, input_data = "train") {
if(input_data == "train"){
model_prediction <- predict.train(input_model, train_x)
df <- postResample(pred = model_prediction, obs = train_y)}
else {
model_prediction <- predict.train(input_model, test_x)
df <- postResample(pred = model_prediction, obs = test_y)}
data_frame("Data" = input_data, 'RMSE' = df[1], 'Rsquared' = df[2], 'MAE' = df[3])
}
Fit a baseline model with the features which have the highest correlation with Amount.
regressControl <- trainControl(method="repeatedcv",
number = 5,
repeats = 1, #set this to 1 for now
returnResamp = "all"
)
baseline <- train(target ~ V2 + V7 + V5,
data = train[-94590,],
method = "lm",
trControl = regressControl)
plot(baseline$finalModel)
eval_model(baseline)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 1.68 0.259 1.31
summary(baseline)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.157 -0.958 0.251 1.231 11.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.568570 0.003708 2041.13 <2e-16 ***
## V2 -0.500492 0.002261 -221.37 <2e-16 ***
## V7 0.165245 0.003088 53.51 <2e-16 ***
## V5 -0.399248 0.002744 -145.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.679 on 205057 degrees of freedom
## Multiple R-squared: 0.2592, Adjusted R-squared: 0.2592
## F-statistic: 2.392e+04 on 3 and 205057 DF, p-value: < 2.2e-16
Drop the high-leverage outliers from the training data.
train <- train %>%
dplyr::slice(-c(14568, 172386, 200929))
From fitting a model with all predictors first we see that V28 through V31 do not appear to be significant.
regressControl <- trainControl(method="repeatedcv",
number = 10,
repeats = 3, #set this to 1 for now
returnResamp = "all"
)
GLM_all <- train(target ~ ., #these have the highest correlations with the Amount
data = train,
method = "lm",
trControl = regressControl)
plot(GLM_all$finalModel)
eval_model(GLM_all)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 1.58 0.341 1.21
Based on the p-values being greater than 0.05, I drop the predictors V28 through V31 from the model.
regressControl <- trainControl(method="repeatedcv",
number = 10,
repeats = 3, #set this to 1 for now
returnResamp = "all"
)
GLM_best <- train(target ~ ., #these have the highest correlations with the Amount
data = train %>% select(-V28, -V30, -V31, -V33, -V13, -V32) ,
method = "lm",
trControl = regressControl)
plot(GLM_best$finalModel)
eval_model(GLM_best)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 1.58 0.341 1.21
summary(GLM_best)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.188 -0.890 0.254 1.136 12.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.569371 0.003497 2164.472 < 2e-16 ***
## V1 -0.086798 0.001791 -48.467 < 2e-16 ***
## V2 -0.501056 0.002133 -234.957 < 2e-16 ***
## V3 -0.039987 0.002320 -17.234 < 2e-16 ***
## V4 -0.030021 0.002474 -12.132 < 2e-16 ***
## V5 -0.399803 0.002591 -154.280 < 2e-16 ***
## V6 0.237545 0.002646 89.786 < 2e-16 ***
## V7 0.164891 0.002918 56.513 < 2e-16 ***
## V8 -0.028754 0.002932 -9.807 < 2e-16 ***
## V9 -0.140543 0.003190 -44.052 < 2e-16 ***
## V10 -0.018265 0.003230 -5.655 1.56e-08 ***
## V11 -0.082116 0.003433 -23.918 < 2e-16 ***
## V12 -0.032781 0.003511 -9.336 < 2e-16 ***
## V14 0.049866 0.003667 13.600 < 2e-16 ***
## V15 -0.112465 0.003824 -29.413 < 2e-16 ***
## V16 -0.204250 0.003998 -51.091 < 2e-16 ***
## V17 0.040542 0.004148 9.773 < 2e-16 ***
## V18 0.095243 0.004178 22.795 < 2e-16 ***
## V19 -0.016562 0.004293 -3.858 0.000114 ***
## V20 0.313050 0.004612 67.879 < 2e-16 ***
## V21 0.203565 0.004790 42.496 < 2e-16 ***
## V22 0.099897 0.004815 20.749 < 2e-16 ***
## V23 -0.068558 0.005577 -12.294 < 2e-16 ***
## V24 -0.040318 0.005775 -6.981 2.94e-12 ***
## V25 -0.018760 0.006704 -2.798 0.005135 **
## V26 -0.082478 0.007256 -11.367 < 2e-16 ***
## V27 -0.156191 0.008837 -17.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.584 on 205032 degrees of freedom
## Multiple R-squared: 0.3412, Adjusted R-squared: 0.3411
## F-statistic: 4084 on 26 and 205032 DF, p-value: < 2.2e-16
regressControl <- trainControl(method="repeatedcv",
number = 10,
repeats = 3,
returnResamp = "all"
)
GLM_all_no_spike <- train(target ~ .,
data = train_no_spikes %>% select(-V28, -V30, -V31, -V33, -V13, -V32),
method = "lm",
trControl = regressControl)
summary(GLM_all_no_spike)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.620 -0.774 0.138 0.897 8.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.993034 0.002878 2776.913 < 2e-16 ***
## V1 -0.061505 0.001470 -41.827 < 2e-16 ***
## V2 -0.413670 0.001715 -241.180 < 2e-16 ***
## V3 -0.073464 0.001940 -37.875 < 2e-16 ***
## V4 0.024382 0.001998 12.201 < 2e-16 ***
## V5 -0.296226 0.002044 -144.916 < 2e-16 ***
## V6 0.172774 0.002130 81.114 < 2e-16 ***
## V7 0.103964 0.002332 44.586 < 2e-16 ***
## V8 -0.054906 0.002554 -21.500 < 2e-16 ***
## V9 -0.078418 0.002604 -30.109 < 2e-16 ***
## V10 -0.078368 0.002767 -28.327 < 2e-16 ***
## V11 -0.050627 0.002845 -17.794 < 2e-16 ***
## V12 0.002081 0.002910 0.715 0.474648
## V14 0.011834 0.003212 3.684 0.000229 ***
## V15 -0.070650 0.003102 -22.777 < 2e-16 ***
## V16 -0.165386 0.003214 -51.451 < 2e-16 ***
## V17 0.024038 0.003596 6.685 2.31e-11 ***
## V18 0.103477 0.003380 30.615 < 2e-16 ***
## V19 -0.044662 0.003449 -12.951 < 2e-16 ***
## V20 0.311606 0.003627 85.910 < 2e-16 ***
## V21 0.175724 0.004094 42.919 < 2e-16 ***
## V22 0.121305 0.003951 30.704 < 2e-16 ***
## V23 -0.093028 0.004370 -21.287 < 2e-16 ***
## V24 -0.045230 0.004694 -9.635 < 2e-16 ***
## V25 0.080241 0.005450 14.723 < 2e-16 ***
## V26 -0.084623 0.005805 -14.578 < 2e-16 ***
## V27 -0.209188 0.007070 -29.588 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.197 on 175380 degrees of freedom
## Multiple R-squared: 0.3987, Adjusted R-squared: 0.3986
## F-statistic: 4473 on 26 and 175380 DF, p-value: < 2.2e-16
eval_model(GLM_all_no_spike)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 1.66 0.329 1.24
plot(GLM_all_no_spike$finalModel)
We look at the residuals vs fitted.
test_no_spikes %>%
mutate(predictions = predict.train(GLM_all_no_spike, test_no_spikes)) %>%
ggplot(aes(target, predictions)) +
geom_point() +
ggtitle("Predictions vs Target for GLM excluding spikes")
To deal with non-linear relationships, we need a more flexible model.
The tuning parameters are
We’ll break down the tuning of these into five sections:
eta and number of iterations nroundsmax_depth and child weight min_child_weightcolsample_bytree and row sampling subsamplegamma valuesetaI can create a diagram of what a tree looks like using the DiagrammeR package.
dummy_xgb <- xgboost(data = train_x, label = train_y, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "reg:linear")
## [1] train-rmse:1.661409
## [2] train-rmse:1.589086
xgb.plot.tree(colnames(train_x), model = dummy_xgb)
As a baseline to the xgboost, we’ll first fit using the default parameters.
grid_default <- expand.grid(
nrounds = 100,
max_depth = 3,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
train_control <- caret::trainControl(
method = "none",
verboseIter = TRUE, # no training log
allowParallel = TRUE # FALSE for reproducible results
)
#avoid needing to refit model when knitting doc
# xgb_base <- caret::train(
# x = train_x,
# y = train_y,
# trControl = train_control,
# tuneGrid = grid_default,
# method = "xgbTree",
# eval_metric = "mae"
# )
#saveRDS(xgb_base, "Models/xgb_base.RDS")
xgb_base <- readRDS("Models/XGB_base.RDS")
eval_model(xgb_base)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 1.10 0.686 0.772
We’ll start with the “bigger knobs” to tune and then use these settings to find the best of the “smaller knobs”, and then come back and refine these more significant paramters. We start by fixing the number of trees. This controls the total number of regression trees to use. This is selected in combination with the learning rate. Using a lower learning rate updates the predictions more slowly and so requires a larger number of iterations, or nrounds in order to minimize the loss function. Setting this too high eventually leads to instability. Using more trees and a lower learning rate is almost always better, but has diminishing returns. To start, in order to reduce compute time when choosing the other parameters, we set this to 1000. After the other parameters have been chose, we will come back and turn this up.
nrounds <- 500
Then we can fill in the other items, using suggestions from (here)[https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions/14].
t1 <- Sys.time()
tune_grid <- expand.grid(
nrounds = seq(from = 50, to = 500, by = 100),
eta = c(0.1, 0.2, 0.4),
max_depth = 3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
tune_control <- caret::trainControl(
method = "cv", # cross-validation
number = 3, # with n folds
verboseIter = FALSE, # no training log
allowParallel = TRUE # FALSE for reproducible results
)
# xgb_tune <- caret::train(
# x = train_x,
# y = train_y,
# trControl = tune_control,
# tuneGrid = tune_grid,
# method = "xgbTree",
# verbose = TRUE
# )
t2 <- Sys.time()
timediff <- t2 - t1
# saveRDS(xgb_tune, "Models/xgb_tune.RDS")
xgb_tune <- readRDS("Models/XGB_tune.RDS")
plot(xgb_tune)
From the plots above, we see that the best learning rate eta is at 0.4, which is a high value just to start. GBMs generally perform better with a larger number of trees, but it takes longer for the model to train. To make this faster, we’ll set the number of trees to 250 while tuning the other parameters.
Next, we move on to finding a good value for the max tree depth. We start with 3 +/- 1. The maximum depth controls the depth or “height” of each tree and helps to avoid overfitting. A higher depth can capture interaction effects better, but setting too high will overfit to the training set. We also try higher values of min_child_weight, which controls the minimum number of observations that are allowed in each end node. Higher values prevent overfitting.
The code below took 30 minutes to fit.
t1 <- Sys.time()
tune_grid2 <- expand.grid(
nrounds = c(150, 300, 500),
eta = xgb_tune$bestTune$eta,
max_depth = c(2, 3, 4),
gamma = 0,
colsample_bytree = 1,
min_child_weight = c(1, 2, 3),
subsample = 1
)
# xgb_tune2 <- caret::train(
# x = train_x,
# y = train_y,
# trControl = tune_control,
# tuneGrid = tune_grid2,
# method = "xgbTree",
# verbose = TRUE
# )
timediff <- Sys.time() - t1
xgb_tune2 <- readRDS("Models/XGB_tune2.RDS")
# saveRDS(xgb_tune2, "Models/xgb_tune2.RDS")
eval_model(xgb_tune2)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 0.784 0.839 0.520
plot(xgb_tune2)
xgb_tune2$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 24 500 4 0.4 0 1 2 1
We see that the best max depth is 4 with min_child_wight of 2.
Finally, now that the tree parameters are turned, I go back and tune the final boosting parameters.
Unfortunately, this was overfitting because although the train MAE is lower, the test MAE is higher. In a perfect world, this could be rerun with a higher number of trees to further improve the performance.
t1 <- Sys.time()
tune_grid3 <- expand.grid(
nrounds = c(600, 900, 1200),
eta = c(0.08, 0.1, 0.3),
max_depth = 4,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 2,
subsample = c(1, 0.8)
)
# xgb_tune3 <- caret::train(
# x = train_x,
# y = train_y,
# trControl = tune_control,
# tuneGrid = tune_grid3,
# method = "xgbTree",
# verbose = TRUE
# )
timediff <- Sys.time() - t1
# saveRDS(xgb_tune3, "Models/xgb_tune3.RDS")
xgb_tune3 <- readRDS("Models/XGB_tune3.RDS")
plot(xgb_tune3)
eval_model(xgb_tune3)
## # A tibble: 1 x 4
## Data RMSE Rsquared MAE
## <chr> <dbl> <dbl> <dbl>
## 1 train 0.699 0.872 0.460
What does the error look like against each incremental tuning step?
xgb_models <- list(xgb_base, xgb_tune, xgb_tune2, xgb_tune3)
xgb_models %>%
map_df(eval_model) %>%
mutate(GBM_tuning_step = 1:4) %>%
rbind(
xgb_models %>%
map_df(., .f = eval_model, "test") %>%
mutate(GBM_tuning_step = 1:4)
) %>% datatable()
Remember to reverse the log and subtract 1 from target before submitting predictions!
#write the predictions to a csv file
make_predictions <- function(model, write = F){
output_path <- paste0("Predictions/",as.character(model$method)," Predictions - ", format(Sys.time(), '%d %B %Y'), ".txt")
pred <- (exp(predict.train(model, holdout_x)) - 1)
if(write == T){
data_frame(ID = test_raw$ID, Amount = pred) %>%
write_tsv(., path = output_path)
} else{return(pred)}
}
xgb_predictions <- make_predictions(xgb_tune2)#This was used as the final model
glm_predictions <- make_predictions(GLM_best)
#write predictions to file
make_predictions(xgb_tune2, write = T)
importance_df <- data_frame(Feature = rownames(varImp(xgb_tune2)$importance), Importance = varImp(xgb_tune2)$importance$Overall) %>% arrange(Importance)
importance_order <- importance_df$Feature
importance_df %>%
mutate(Feature = fct_relevel(Feature, importance_order)) %>%
ggplot(aes(Feature, Importance)) +
geom_bar(stat = "identity", fill = "dodgerblue") +
coord_flip()
varImp(xgb_tune2)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 32)
##
## Overall
## V2 100.000
## V1 31.485
## V7 31.155
## V20 26.315
## V5 22.344
## V6 7.968
## V3 6.595
## V4 5.478
## V23 3.567
## V27 2.876
## V8 2.853
## V21 2.641
## V26 2.360
## V22 2.333
## V28 2.152
## V16 2.101
## V15 1.981
## V9 1.778
## V10 1.715
## V18 1.685
top_5_corr_with_amount
## [1] "V2" "V5" "V6" "V20" "V16"
We can look at the partial dependence plots.
make_gg_partial <- function(input_feature){
sample_index <- base::sample(x = nrow(train_x), size = 50000)
partial_dep_data <- pdp::partial(xgb_tune2, pred.var = c(input_feature), train = train_x[sample_index,])
partial_dep_data %>%
ggplot(aes_string(input_feature, "yhat")) +
geom_line() +
geom_point(colour = "dodgerblue")
}
V1_partial <- make_gg_partial("V1") + xlim(-4, 2.5)
V2_partial <- make_gg_partial("V2") + xlim(-4, 5)
V7_partial <- make_gg_partial("V7") + xlim(-4, 5)
V20_partial <- make_gg_partial("V20") + xlim(-4, 5)
V5_partial <- make_gg_partial("V5") + xlim(-4, 5)
ggarrange(V1_partial, V2_partial, V7_partial, V20_partial, V5_partial)
Reference Texts
R Software Packages
Online Articles (More than can be listed)
Generalized Linear Models for Insurance Ratemaking A Gentle Introduction to XGBoost for Applied Machine Learning *An End-to-End Guide to Understandign XGBoost
Various Kaggle repositories